AUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus
نویسندگان
چکیده
The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test bed. This study exploited the use of the WWW to retrieve documents that are written in a minority language. We employed a frequency-based algorithm to build the lexicon. For our evaluation, we considered 260 Tagalog documents extracted from the web as our corpus. From the corpus, the system automatically selected 1,386 candidate unique words based on the threshold (with value of 10) as the lexical entries. Each lexical entry is validated by a language expert. Our evaluation shows an accuracy of 97.84% and only 2.16% error rate. The error was based on incorrectly spelled words or words that are not Tagalog.
منابع مشابه
Adaptation of the F-measure to Cluster Based Lexicon Quality Evaluation
An external lexicon quality measure called the L-measure is derived from the F-measure (Rijsbergen, 1979; Larsen and Aone, 1999). The typically small sample sizes available for minority languages and the evaluation of Semitic language lexicons are two main factors considered. Large-scale evaluation results for the Maltilex Corpus are presented (Rosner et
متن کاملUsing speech recognition technique for constructing a phonetically transcribed taiwanese (min-nan) text corpus
Collection of Taiwanese text corpus with phonetic transcription suffers from the problems of multiple pronunciation variation. By augmenting the text with speech, and using automatic speech recognition with a sausage searching net constructed from the multiple pronunciations of the text corresponding to its speech utterance, we are able to reduce the effort for phonetic transcription. By using ...
متن کاملThe Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the f...
متن کاملInvestigating automatic decomposition for ASR in less represented languages
This paper addresses the use of an automatic decomposition method to reduce lexical variety and thereby improve speech recognition of less well-represented languages. The Amharic language has been selected for these experiments since only a small quantity of resources are available compared to well-covered languages. Inspired by the Harris algorithm, the method automatically generates plausible...
متن کاملData Driven Approaches to Phonetic Transcription with Integration of Automatic Speech Recognition and Grapheme-to-Phoneme for Spoken Buddhist Sutra
We propose a new approach for performing phonetic transcription of text that utilizes automatic speech recognition (ASR) to help traditional grapheme-to-phoneme (G2P) techniques. This approach was applied to transcribe Chinese text into Taiwanese phonetic symbols. By augmenting the text with speech and using automatic speech recognition with a sausage searching net constructed from multiple pro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010